System and method for differentially locating and modifying audio sources

ABSTRACT

A system and method for differentially locating and modifying audio sources that includes receiving multiple audio inputs from a set of distinct locations; determining a multi-dimensional audio map from the audio inputs; acquiring a set of positional audio control inputs applied to the audio map, each audio control input comprising a location and audio processing property; and generating an audio output according to the audio control inputs and the audio inputs. The audio control inputs capable of configuration through manual, automatic, computer vision analysis, and other configuration modes.

CROSS-REFERENCE TO RELATED APPLICATIONS

This Application is a Continuation Application of U.S. patentapplication Ser. No. 17/192,101, filed on 4 Mar. 2021, which is aContinuation Application of U.S. patent application Ser. No. 16/803,692,filed on 27 Feb. 2020, and granted on 6 Apr. 2021 as U.S. Pat. No.10,970,037, which is a Continuation Application of U.S. patentapplication Ser. No. 16/528,534, filed on 31 Jul. 2019, and granted asU.S. Pat. No. 10,613,823, which is a Continuation Application of U.S.patent application Ser. No. 15/717,753, filed on 27 Sep. 2017, andgranted as U.S. Pat. No. 10,409,548, which claims the benefit of U.S.Provisional Application No. 62/400,591, filed on 27 Sep. 2016, all ofwhich are incorporated in their entirety by this reference.

TECHNICAL FIELD

This invention relates generally to the field of audio manipulation, andmore specifically to a new and useful system and method fordifferentially locating audio modifications, to associate with sources.

BACKGROUND

An environment with multiple audio sources can complicate communicationsbetween different agents. A person may want to listen, carry on aconversation, or otherwise engage with one or more people, but varioussources of noise can detract from hearing others or being heard. This isparticularly true for the hearing impaired that rely on hearing aids andother devices to assist with hearing. Similarly, at a presentation orlive performance, there may be various sources of sounds that maydetract from following the main focus. In some meeting situations, thereare sometimes multiple conversation threads that will be going on withdifferent levels of importance and intended for different scopes ofaudiences. Noisy environments may additionally hamper the ability for acomputing system to use audio interfaces. In particular, audiointerfaces are generally limited to directly listening to a controllerspeaking to a microphone device—this setup may not be suitable for everysituation.

Thus, there is a need in the audio manipulation field to create a newand useful system and method for differentially locating audiomodifications such that they differentially apply to different audiosources. This invention provides such a new and useful system andmethod.

BRIEF DESCRIPTION OF THE FIGURES

FIGS. 1-3 are a schematic representations of a system of a preferredembodiment;

FIG. 4 is a schematic representation of generating an audio map frommultiple audio inputs;

FIG. 5 is an exemplary representation of a control interface;

FIG. 6 is a schematic representation of tracking an audio source;

FIG. 7 is a schematic representation of extracting an audio source andgenerating an audio output;

FIG. 8 is a flowchart representation of a method of a preferredembodiment;

FIG. 9 is a schematic representation of a method representing an audiomap in a user interface for manual configuration;

FIGS. 10 and 11 are schematic representations of variations using anautomatic configuration mode;

FIGS. 12 and 13 are schematic representations of variations using acomputer vision based configuration mode;

FIGS. 14A and 14B are schematic representation of a variation usinggroup attention detection;

FIG. 15 is a schematic representation of a variation using a personenvironment orientation configuration mode; and

FIG. 16 is a schematic representation of applying the audio output to aaudio user interface.

DESCRIPTION OF THE EMBODIMENTS

The following description of the embodiments of the invention is notintended to limit the invention to these embodiments but rather toenable a person skilled in the art to make and use this invention.

1. Overview

A system and method for differentially locating and modifying audiosources of a preferred embodiment generates a positional map of audiosources and applies active audio modifications of audio sources based onmapped position of identified audio source. Audio modifications includeamplification, deamplification, equalization, filtering, compression,isolation, synthesizing to a second audio streams, combining audiosources, and/or any number of other audio modifications. This can beused to apply customized handling of distinct audio sources within anenvironment. The system and method is preferably applied for generatingaudio for one user, but can additionally be used for generating multiplecustomized audio outputs for different audiences (e.g., multiple users).

In one preferred application, the system and method may be used ingenerating an audio stream that is composed of selected audio sourcesdetected within the environment. Herein, audio sources refers to audiocharacterized as originating from some source such as a person, speaker,machine, or other suitable sources of sound. The audio sources arepreferably extracted into an audio data representation of the audiosource. The audio processing of audio sources may additionally beindividually controlled. More specifically, customized audio adjustmentsand processing can be applied to audio signals originating from alocation or region. Some locations may be amplified and other locationsmay be deamplified. In this way, one exemplary application may be tominimize audio from one set of sources and to amplify or enhance audiofrom a second set of sources.

The system and method preferably expose a positional map of audio, whichmay be used in enabling various approaches for setting and modifyinggenerated audio. Audio control inputs are settings used in configuringaudio generation. The audio control inputs can be set for differentlocations of the audio map and preferably for particular audio sources.Some preferred modes of controlling generated audio can include manualselection within a graphical representation of sources, audio contentanalysis, computer vision (CV) based analysis of the environment, agentenvironment orientation (e.g., person position and/or body directedorientation), and applying preconfigured settings based on detectedconditions.

In one variation of enabling manual selection, the system and method canprovide a visual representation of the audio landscape around a user.More specifically, the system and method can offer a user interface thatcan be used to control generated audio through specifying audio controlinput for different locations or regions in the environment. As anexemplary implementation, a user can use a smart phone application toview a graphical representation of an audio map near the user. The usercan then use simple actions to set different positional audio controlinputs for different locations or regions on the audio source map. Theaudio control inputs can specify a location/region and types of audiomodifications such as how to adjust the volume and/or equalize the audiooriginating from that location. The set of positional audio controlinputs are then used to generate/augment an audio signal that drives apersonal audio system. For example, a user may use his smart phone toselectively mark different locations to amplify her friends sittingaround her while reducing the volume of the background music and otherssitting nearby, but who are not of interest. As another example, a usermay choose to reduce the low frequency signals from a human speaker orselectively amplify just the frequencies used in human speech.

In a variation enabling automatic setting of audio control inputs, audiocontrol inputs may be automatically set and adjusted dynamically basedon analysis of the audio, audio sources, audio content, and/or the audiomap. For example, detection of audio sources may be achieved throughanalysis of the audio map, and then isolated audio from each audiosource may be analyzed to determine its significance to determine howit's individual volume is set during generation of a compiled audiooutput. In another example, speech content and, in particular, speechcontent detected to be part of conversation may be enhanced while music,machine noise, background chatter, and/or other audio sources can beclassified as such and deamplified from the audio output.

In a variation using supplementary environment sensing, an imagingsystem with a computer vision (CV) based monitoring system can generateCV-based interpretations of the environment that can be used in settingaudio control inputs. Visual detection and classifications of objectscan be mapped to audio sources, and visually detected details like thedirection of attention of one or more people can be used in setting theaudio control inputs. For example, the audio output generated for a usermay be at least partially based on the visual detection of where theuser is directing his or her attention. As a related variation,alternative sensing systems can be used in addition to or in place of aCV-based system. In one variation, sensing of an agent's environmentorientation may be used in setting audio control inputs. The positionand/or direction orientation of an agent (e.g., a person) can be used toset audio control inputs.

The system and method can have applications in various forms ofenvironments. In one preferred use case, the system and method are usedin facilitating assisted conversations in environments with multipleconversations. In particular, the system may be used by attendees atpresentations, performances, and conferences to enhance their ability tocommunicate to different people at different times. For example, a usercould selectively transition between targeting their listening to apresenter and to a group of colleagues sitting nearby.

In another use case, the system and method are used in combination witha listening device (e.g., hearing assistance app, a hearing aid,cochlear implant, etc.). In one implementation, a listening device caninclude an operating mode that can use positional audio control inputsspecified by a connected user application to augment the audio of thehearing aids. Furthermore, other users (that may or may not use asimilar listening device) can use a connected application or recordingdevice to collecting of audio input from different locations tofacilitate generation of an audio map.

In another use case, the system and method are used in an environmentthat provides public ambient computing. This will generally be where acomputing system is interested in collecting isolated audio input fromdistinct groups of people. In places of commerce, this may be used toenable people within a store to issue audio commands that are uniquelyassociated with that particular person. For example, an environmentequipped with a CV-based monitoring system such as the one described inU.S. patent application Ser. No. 15/590,467, filed 9 May 2017, which ishereby incorporated in its entirety by this reference, may includemicrophones integrated with the imaging system that are distributedthrough an environment, and then use the system and method may be usedfor detecting audio commands associated with distinct users. As anotherexample, a store may use this within their store so that workers can beenabled to issue audio commands without wearing any personal audiorecording equipment. General audio recording of the environment possiblycoupled with CV-based monitoring of the environment can be used topickup and isolate audio sources from workers.

As one potential benefit, the system and method may enable variouscomplex forms of layered audio processing within complex audioenvironments. The system and method can preferably address some of thechallenges present in environments where there are multiple, competing,and overlapping audio sources.

As a related potential benefit, the system and method may be used togenerate different audio outputs that use customized processing of audiosources in an environment. For example, within one environment and froma shared set of audio input, distinct audio outputs may be generated,where the distinct audio outputs may be compiled from unique sets ofaudio sources. In other words, a first and second user can eachcustomize whom they want to listen to independent of the other user.

As another potential benefit, the system and method may enable a varietyof modes of controlling generated audio output. In some variations,various sensing and content analysis may be used to drive differentforms of audio control. As mentioned, audio content analysis and/orCV-based environment monitoring may be used to augment the generation ofaudio output.

2. System for Differentially Locating and Modifying Audio Sources

As shown in FIG. 1 , a system for differentially locating and modifyingaudio sources of a preferred embodiment can include a set of microphones110, a spatial audio analyzer 120, an audio control configuration system130, and an audio generator 140 with an audio source extraction engine150. The set of microphones no is configured to record or otherwisecollect audio signals from a set of distinct locations. The spatialaudio analyzer 120 is configured to produce a multi-dimensional audiomap from audio input record by the set of microphones 110. The audioconfiguration system 130 is preferably configured to acquire and set oneor more positional audio control inputs that define audio processing forone or more location of the audio map. An audio generator 140 ispreferably configured to generate at least one audio output based onaudio input collected from the set of microphones. The audio generatorcan include the audio source extraction engine 150, which is configuredto extract or substantially isolate at least one audio source that isused in compiling the audio output. In preferred instances, the audiooutput is a compiled set of layered audio sources.

The set of microphones 110 functions to acquire multiple audio inputsfrom a set of distinct locations. The set of microphones 110 can operateas a microphone array with at least two microphones and preferably threeor more that collect audio recordings from different vantage points. Themicrophones 110 in combination with the spatial audio analyzer 120generate a map of audio sources (i.e., an audio map). The audio map ispreferably a two-dimensional map but may alternatively bethree-dimensional or represented in any suitable format.

The microphones 110 can be omnidirectional microphones but mayalternatively be directional microphones. Orientation of directionalmicrophones may be used to facilitate locating audio source position.

In one implementation, the set of microphones 110 is a set ofmicrophones that are mechanically coupled through a structure. Eachmicrophone serves as distinct audio sensor that is positioned within theenvironment. For example, a speakerphone in a conference room may have aset of microphones at distinct points. In another example, microphonesmay be installed in various parts of the environment. When the systemincludes a CV-based monitoring system, microphones may be integratedinto the imaging devices distributed across an environment. In anotherexemplary implementation, the set of microphones 110 can be a microphonearray unit that is a telescoping device that a user could carry andeasily extend and set on a surface when the user wants to use the systemas shown in FIG. 2 .

In another implementation, the set of microphones includes a distributedset of personal computing devices capable of recording audio. Thedistributed recording device variation can preferably enable ad-hocimplementation of the system in different environments. In one exemplaryimplementation, multiple smart phones of different users could actcollectively as a microphone array as shown in FIG. 3 . In anotherexemplary implementation, the microphones integrated into hearing aidsand worn in or on a user's ears may serve to collect audio input.

The relative positioning of the microphones is preferably known.Alternatively, approximate relative positioning may be specified througha user interface. For example, a user may mark where connectedmicrophones are located. As another alternative, a calibration processcould be used to approximate relative positioning. This calibrationphase might include calibration pings that come from one source at atime, or rely instead on the passive background audio. As yet anotheralternative, microphones can be paired with speakers, as is naturallythe case with smartphones, to facilitate the automatic calibrationprocess. In another variation, CV-based monitoring of the environmentcan detect and/or predict locations of microphones and their relativepositioning. In the case where a microphone is coupled to an imagingdevice, image mapping of overlapping areas of an imaging data or 3Dreconstruction of the space may be used to approximate the relativepositioning of the imaging devices and thereby the relative positioningof the audio microphones.

The spatial audio analyzer 120 functions generate an audio map frommultiple audio inputs. The spatial audio analyzer 120 preferably usesone or more forms of acoustic source localization. The spatial audioanalyzer 120 preferably includes configuration to analyze the relativedelay in arrival time of audio signals detected from the collected audioinputs, and uses such time delay phase shift detection to produce a mapof audio sources in space.

Multiple audio sources may be recorded in the multiple audio inputs withdifferent phase shifts depending on the positioning of the audio sourceto the respective microphones. In one implementation, audio featuredetection can be applied to detect different audio features indicativeof an audio source. The time delay phase shift of audio features betweendifferent audio inputs can be calculated and then used in triangulatinga location. Various forms of signal processing, machine learning, and/orother forms of analysis may be used in mapping subcomponents of an audioinput to different locations. As one example a frequency domain signalof the audio inputs may be analyzed in detecting and segmenting audiofeatures that share common time differences. As shown in the simplifiedscenario of FIG. 4 , audio generated at different points will berecorded as overlapping signals. Audio feature detection can be usedacross the audio inputs and the time delay in the different audio inputsused in calculating location. An audio map can be generated byperforming such a process across multiple audio features. In anotherimplementation, a deep learning model trained on locating of audiosources may be used. Other suitable audio locating processes mayadditionally or alternatively be used.

As mentioned, the audio map is preferably two-dimensional but mayalternatively be three-dimensional or represented in any suitableformat. The spatial audio analyzer uses differences in the audiosignals, triangulation, and time of flight of sound to approximatelocation of different audio components in the audio inputs. Theresulting audio map is preferably a time-series data representation ofaudio as generated across the multi-dimensional space.

In some alternative implementations, the spatial audio analyzer 120 maygenerate alternative forms of audio maps. For example, a radial view ofthe directionality of audio sources may be an alternative form of anaudio map. A radial view may be suitable in situations where amicrophone array is positioned on a table and the expected sources ofaudio originate around the microphone array at varying distances. Otheralternative forms of a map can similarly be used, which would generallybe customized to the use case and/or configuration of the microphones.

The audio control configuration system 130 functions to enable thecustomization of the audio output by setting positional audio controlinputs to the audio map. An audio source input is preferablyconfiguration that characterizes how audio is to be extracted from theaudio input and modified into an audio output. The generation of audiooutput by the audio generator 140 will preferably involve extractingaudio from locations indicated by the audio control inputs and therebyforming distinct audio source channels and then optionally combiningand/or processing the audio source channels.

An audio source input preferably includes a location property and one ormore audio processing operator property. The location property can be adistinct point, but may alternatively be a region. In some variations,it may additionally be mapped to a particular source in addition tolocation. For example, some implementations may us CV-based monitoringsystem to determine that a machine and a person are in the same generallocation, and two audio control inputs may be assigned to that samegeneral location but individually assigned to audio generated by themachine (e.g., noise) and audio generated by the person (e.g., speech).

The audio processing operators can be various transformations that areto be applied to audio source channel isolated or extracted on behalf ofthe audio source input. Audio processing operators may includeamplification, deamplification, audio equalization, filtering,compression, speed changes, pitch shifts, synthesizing to a second audiostream (e.g., translating to another language and then generating speechaudio in the other language), or other suitable audio effect. For audiosources of interest, the audio processing operator can enhance the audiofrom that audio source. Audio processing operators may not always beconstructive enhancements. As on example, ambulance drivers may use thissystem to reduce the noise coming from their own siren so as to betterhear each other and they passengers. This reduction may rely onlocation, frequency, or some other form of source identification. Anaudio source input may additionally characterize other operationalfeatures such as enabling recording, speech to text, censoring orcontent monitoring (e.g., content analysis and blocking of illicitcontent), streaming to another destination, or any suitable type ofanalysis or processing. An audio source input can have any suitablenumber of processing operators that are configured to be applied in anysuitable configuration. In some instances, the audio source channelafter extraction is not additionally processed.

The audio control configuration system 130 may offer one or more of avariety of configuration modes such as a manual configuration mode, anautomatic configuration mode, a CV-based configuration mode, and/or apositioning configuration mode. The various modes may additionally beused in combination. Settings of a user for an audio controlconfiguration system 130 may initialized for every new environment.Alternatively, settings may be stored as part of a profile so that theycan be automatically enabled. Settings may be stored for a particularroom/environment, location, condition, speaker, person, audio source, orother suitable scope.

In a variation with a manual configuration mode, the audio controlconfiguration system 130 can enable manual setting and editing of audiocontrol inputs. A manual configuration mode preferably includes agraphical audio control interface configured to set different audiocontrol inputs at different locations of a representation of an audiomap. The audio control inputs can be graphically and interactivelymapped to locations of the audio map in a user interface.

A manual configuration mode of the audio control configuration system130 preferably includes a control application operable on a personalcomputing device or any suitable computing device. The controlapplication can be an app, web application, or other suitable interface.The control application may alternatively be part of a dedicatedcontroller. For example, the system may include hearing aids with aphysical remote control that includes the control application.

The control application preferably graphically represents the map ofaudio sources as shown in FIG. 5 . In one implementation, the controlapplication can display a heat map where color maps to amplitude orfrequency profile of audio originating from a particular point orregion. In this way, audio sources may be graphically detectable in avisual manner. In one variation, the map can display a liverepresentation of the audio sources so that fluctuations in amplitudemay facilitate the user determining what audio maps to what space in thegraphic. In another implementation, the control application mayrepresent distinct sources of audio with a static or animated graphic,or one representing richer information. For example, automaticclassification of an audio source may enable audio sources to berepresented by graphical classifications (e.g., person, audio system,noise source, etc.).

The graphical representation can additionally be synchronized to theposition and orientation of the control application. In one variation,the graphical representation of the audio map may be overlaid orsynchronized with real-world imagery to provide an augmented realityview of audio sources. When coupled with an image stabilized, heads updisplay, audio modification may be associated with objects in the visualrendering. Alternatively, image data collected from an imaging devicemay be able to provide a visual representation of the environment. Forexample, a surveillance system installed in an environment may be usedto create an overhead 2D view of the environment with an overlaid audiomap. An accelerometer, gyroscope, or compass (electronic or otherwise)could also be used to rotate and translate the visual representationappropriately as the viewing screen is moved.

The control application can additionally include some user interactionthat supports adding, customizing, and/or removing positional audiocontrol inputs. The audio control inputs can preferably be added to arepresentation of an audio map. A positional audio control inputcharacterizes the desired audio modifications to be applied to audiooriginating at or near a particular position. A user may be able setamplification, deamplification, audio equalization, filtering,compression, speed changes, pitch shifts, synthesizing to a second audiostream (e.g., translating to another language and then generating speechaudio in the other language), or other suitable audio effect.

In one implementation, a user can tap a location on the audio map to seta point and then use pinching and spreading gestures on a touch deviceto decrease or increase the volume at that point. Additionalconfiguration gestures could also be supported to set audio equalizationor other modifications. Other forms of touch input such as sidewayspinches, double and triple taps, pressure-sensitive presses, and/orother forms of inputs may be mapped to how audio processing operatorsare set for an audio control input.

The location of an audio control input may be manually set. For example,a user may be able to tap at any location to specify a new audio controlinput that is applied to audio associated with that region. In apreferred usage scenario, a user will self-identify probable locationsof an audio source and set the location of an audio control input tothat location. Alternatively, automatic detection of audio sources maybe performed in setting the location and audio processing operators areconfigured or edited by a user. A user interface may limit setting ofaudio processing operators to the audio control inputs instantiated forautomatically detected audio sources.

In some variations, the location of an audio control input can track anaudio source. When a audio control input is established, the audioproperties at that position may provide a set of characteristics thatcan be used to track motion as shown in FIG. 6 . The audio propertiescan include the magnitude, frequency distribution, pitch center,automatic classifications (e.g., music, voice, etc.), and/or otherproperties. This set of characteristics can be used, for example, toidentify a person of particular interest to the user, and track herposition, and differentially modify her voice to make it easier todiscern to the user. Additionally or alternatively, a lock on an audiosource may track the audio source of incremental changes in location.Furthermore, audio sources can be tracked using other supplementarysensing systems such as a global positioning system, a local positioningsystem, RFID, near field communications, computer vision, lidar,ultrasonic sensing, sonic, and/or other suitable tracking systems.

In a variation with an automatic configuration mode, the audio controlconfiguration system 130 can enable audio control inputs to be partiallyor fully configured. As described above, automatic configuration may beused to supplement other configuration modes like locating and trackingof an audio control input. Audio sources can be configured to bedetected through analysis of the audio map. The audio sources may thenbe classified or identified. This can be used in distinguishing users,speakers, and/or other sources of sound. Users may additionally beidentified or classified. For example, speaker detection may identify auser.

Audio processing operators of an audio control input may beautomatically set through an automatic configuration mode. Differentaudio processing operators may be set based on various factors such as:audio source classification; audio source location; relative positioningof audio sources, subjects of the audio output, and/or other objects inthe environment; and/or analysis/content of an audio source.

In one variation, the audio from the audio source can be extractedthrough the audio source extraction engine 150 and then processed todetermine subsequent processing before possible use in the audio output.In one variation, the audio content can be classified. Audio contentthat is classified as speech may additionally be converted to text andthen various forms of content analysis can be performed. For example,the significance of the content may determine whether that audio sourceis amplified or deamplified/muted. In another example, the content ofmultiple speakers may be analyzed to determine other content that ispart of a shared conversation so that a listener can follow aconversation while reducing the interference of a competingconversation. Analysis of multiple audio sources may alternatively beused for any suitable form of scenario analysis in determining how audioprocessing should be applied for an audio output.

In a variation with a CV-based configuration mode, the audio controlconfiguration system 130 can use image-based monitoring to set oraugment the setting of one or more audio control inputs. A system with aCV-based configuration mode preferably includes a CV-based monitoringsystem 160 with access to an imaging system. The CV-based monitoringsystem 160 is preferably used in performing image analysis to generatean interpretation that can be used in setting of an audio control input.

As one aspect, the CV-based monitoring system 160 may provide asupplementary sensing mechanism for identifying audio sources. In onevariation, an imaging system in connection with the CV-based monitoringsystem 160 is employed to locate potential sources and then to associateaudio source channel streams with those sources. A source might be aperson, loudspeaker, machine, or another object. The various sources canbe identified and counted as potential audio sources that can beassigned an audio control input. Visually detected potential audiosources may also be used in generating the audio map. For example,visually identified potential audio sources and their visuallyidentified locations may provide a starting point for isolating audiosource channels form a set of audio inputs. In another variation,biometric visual identification of people can be performed, and audiocontrol inputs that were pre-configured for that person can be assigned.

As another aspect, the CV-based monitoring system 160 may detect thedirection of attention. For example, the CV-based monitoring system 160can detect where a subject is looking and set the audio control inputsso that the audio corresponds to the audio source(s) receiving attentionby the subject. Attention in one implementation can include body, head,and/or gaze direction. In a similar variation, the attention of multiplesubjects can be detected, and then the combined group attention may beused. This can be used in a conference setting to transition betweenpeople who have the focus of a group of people.

A CV-based monitoring system 160 may function to process and generateconclusions from one or more sources of image data. The CV-basedmonitoring system 160 can provide: person detection; personidentification; person tracking; object detection; object classification(e.g., product identification); object tracking; extraction ofinformation from device interface sources; gesture, event, and/orinteraction detection; scene description; and/or any suitable form ofimage data analysis using computer vision and optionally otherprocessing techniques. The CV-based monitoring system 160 is preferablyused to drive CV-based applications of an interaction platform. In oneexemplary scenario of CV-based commerce, the CV-based monitoring system160 may facilitate generation of a virtual cart during shopping,tracking inventory state, tracking user interactions with objects,controlling devices in coordination with CV-derived observations, and/orother interactions. The CV-based monitoring system 160 will preferablyinclude various computing elements used in processing image datacollected by an imaging system. In particular, the CV-based imagingsystem is configured for detection of agents (e.g., people, roboticentity, or other entity that may be a subject of interest) andgeneration of a virtual cart based on interactions between people andproducts. Other suitable CV-based application may alternatively be usedsuch as security monitoring, environment analytics, or any suitableapplication. In some variations, the CV-based monitoring system 160 mayexclusively be used in supplementing the operation of the system.

A CV-based monitoring system 160 will include or have access to at leastone form of an imaging system that functions to collect image datawithin the environment. The imaging system preferably includes a set ofimage capture devices. The imaging system might collect some combinationof visual, infrared, depth-based, lidar, radar, sonar, ultrasoundreflection, and/or other types of image data. The imaging system ispreferably positioned at a number of distinct vantage points. However,in one variation, the imaging system may include only a single imagecapture device. The image data is preferably video but can alternativelybe a set of periodic static images. In one implementation, the imagingsystem may collect image data from existing surveillance or videosystems. The image capture devices may be permanently situated in fixedlocations. Alternatively, some or all may be moved, panned, zoomed, orcarried throughout the facility in order to acquire more variedperspective views. In one variation, a subset or all of imaging devicescan be mobile cameras (e.g., wearable cameras or cameras of personalcomputing devices). For example, the imaging system can be anapplication using the camera of a smart phone, smart glasses, or anysuitable personal imaging device.

In a variation with a positioning configuration mode, the audio controlconfiguration system 130 can enable other forms of position and/ororientation sensing to drive the dynamic setting of audio controlinputs. In a positioning variation, the system can additionally includea person orientation sensing system 170 that functions to detectlocation, direction, and/or orientation of one or more subjects.Location, direction, and/or orientation may be used in a manner similarto the CV-based monitoring system 160 above, where the “attention” ofone or more subjects can drive setting of audio control inputs, where“attention” is indicated by where the subjects direct their body. In onevariation, an inertial measurement unit with accelerometers, gyroscopes,and/or a magnetometer may be coupled to a user and used in detecting thedirection and possible approximating the location of a subject. Inanother variation, a GPS or local positioning system (RFtriangulation/beaconing) may be used in getting location of a subject inan environment. This may be used in dynamically increasing volume foraudio sources in close proximity of a subject and reducing or mutingvolume of audio sources outside of close proximity.

The audio generator 140 functions to generate an audio output accordingto the positional audio control inputs. The audio generator 140 morespecifically modifies audio of the audio inputs into at least one audiooutput. The audio generator 140 can be configured to extract andsubstantially isolate audio sources for each location of an active audiocontrol input, optionally apply audio processing operators if specified,and combine the isolated audio sources into an output. Multiple, customaudio outputs may also be generated from the same set of audio inputs.

The audio generator 140 may be integrated with or cooperate with thespatial audio analyzer 120. Preferably the audio generator 140 includesan audio source extraction engine 150 that functions to extract andpreferably substantially isolate an audio source in a audio datarepresentation. The recorded audio inputs can preferably be broken downinto audio source channels associated with different locations. Theaudio compensator preferably operates on audio signal recorded by themicrophone array to produce an output audio signal.

The audio source extraction engine 150 can preferably facilitateapplying phased alignment processing and/or the combining of multipleaudio sources. The audio source extraction engine 150 may operate incombination with the audio generator, the spatial audio analyzer, and/orindependently. As shown in example of FIG. 7 , three audio inputs mayrecord the audio from two sources. The two audio sources can beextracted, processed, and mixed. In another exemplary scenario, twoaudio sources may be extracted and enhanced with audio processing, athird audio source attenuated (e.g., “muted”), and then the two audiosources combined to form the audio output. Attenuation, amplification,filtering and other modifications can be applied independently orcollectively to appropriate audio components. Additionally, constructiveor destructive interference can be applied to the audio components orsignals.

In one variation, the audio generator 140 produces an audio output thatcan be played by any suitable audio system. Alternatively, the systemmay include an audio system (e.g., speaker system). In one example, thesystem can be integrated with listening devices (e.g., hearing aids),wired headphones, wireless headphones, and/or other suitable types ofpersonal audio systems.

3. Method for Differentially Locating and Modifying Audio Sources

As shown in FIG. 8 , a method for differentially locating and modifyingaudio sources of a preferred embodiment includes receiving multipleaudio inputs from a set of distinct locations S110; determining amulti-dimensional audio map from the audio inputs S120; acquiring a setof positional audio control inputs applied to the audio map, eachpositional audio control input comprising a location and audioprocessing property S130; and generating an audio output according tothe positional audio control inputs and the audio inputs S140. Themethod may additionally implement one or more configuration modes suchas manual configuration, automatic configuration, semi-automaticconfiguration, CV-based configuration, agent sensing basedconfiguration, and/or any suitable form of configuration. The method ispreferably implemented by a system described above, but the method mayalternatively be implemented by any suitable system. The method may beapplied to a variety of use cases such as custom personal audio control,enhanced conferencing or meeting recording, enhanced hearing assistance,for enabling parallel audio-interfaces in an environment, and/or for anysuitable application of the method.

Block S110, which includes receiving multiple audio inputs from a set ofdistinct locations, functions to use an array of microphones to recordor otherwise sense audio signals. The distinct locations may be arrangedto enable suitable locating of audio sources. In some variations, thedetection and extraction of audio sources may be biased in particulardirections or regions. The position of the microphones can additionallybe critical in determining a multi-dimensional audio map used inidentifying and isolating audio sources.

In one variation, receiving multiple audio inputs comprises recordingaudio from a microphone array that comprises distinct microphonespositioned within an environment. The distinct microphones arepreferably static or otherwise rigidly mounted at different points. Forexample, the microphones may be integrated into an imaging systemcollecting image data of the environment. In another implementation, themicrophones may be physically coupled which may enable the relativepositioning to be substantially known based on the structure to whichthe microphones are coupled. For example, a microphone array device mayinclude a frame that ensures consistent positioning and orientation forthe set of microphones.

In another variation, receiving multiple audio inputs comprisesrecording audio from a distributed set of computing devices andcommunicating the audio input from each personal computing device to aprocessing system. This variation can enable personal computing devicessuch as a smart phone, smart glasses, personal computer, or othersuitable computing device to be used in establishing an audio map.Preferably, an application facilitates the cooperative recording ofaudio input from multiple locations. The recorded audio of each devicecan then be communicated to central location where the set of audioinputs can be processed in combination.

The method may additionally include calibrating of audio inputpositioning used in determining relative positioning between differentrecording devices. When integrated with an imaging system, the positionof the microphones may be detected visually. In one implementation wheremicrophones are coupled to image capture devices, visual interpretationof the scene and overlapping fields of view may be used to predictrelative positions of the image capture device and thereby predictingrelative positions of audio inputs. In another implementation, multipleapplications open for use as an audio input may be visually detected andused in determining relative position. In another variation, therelative position of the microphones may be approximated by a user. Forexample, a user may tap in a control application the locations of usedmicrophones. Locations of potential audio sources (e.g., such as wherethe user and friends are sitting).

Block S120, which includes determining a multi-dimensional audio mapfrom the audio inputs, functions to use audio triangulation and/or otheraudio processing techniques to approximate where particular componentsof an audio signal originated.

Determining a multi-dimensional audio map preferably creates a spatialmapping of the originating location of audio sounds that are recorded inthe environment. Audio generally originates from some point or smallregion as in the case of a person talking. The fidelity of location maybe generalized to general regions in a space. The phase-shifteddifferences of audio sounds as recorded by different audio inputs can betranslated to displacement and/or position information. Signalprocessing, machine learning, heuristic-based processing and/or othertechniques may be applied in the generation of an audio map.

In one implementation, determining a multi-dimensional audio mapincludes triangulating an at least approximate location of audio soundsidentified in at least a subset of audio inputs. This may additionallyinclude detecting of audio features across multiple audio inputs andcalculating the time differences. Audio features as used herein aredetectable segments or components of source that can be associated withan audio source. Audio features can be time-domain patterns, frequencydomain patterns, and/or any suitable other analysis of audio signal. Asthe method may encounter multiple overlapping audio sources, audiofeature detection preferably works during interference. Shared audiofeatures may additionally be identified in audio inputs with time delayphase shifts (except in scenarios such as when the audio source isequidistant from the microphones). The time delays can then be used inusing the time of flight of sound in triangulating the location.

In some variations, supplemental sensing such as CV-based monitoring ofpeople and other audio sources can be used in generating the audio mapand/or in assisting the generation of an audio map. In one alternativevariation, the audio map could be entirely based on potential audiosources. For example, a CV-based monitoring system may identify thepeople in the room and location and that could be used as potentialaudio sources. Each person could be assigned an audio control input, andif audio is detected originating from tracked location of that person,then the audio control input assigned to that person could be used.

Block S130, which includes acquiring a set of positional audio controlinputs applied to the audio map, functions to set how audio is to bemodified based on the positional source of different sounds. The set ofaudio control inputs are preferably used in collectively defining howdifferent audio “channels” from different locations are used ingenerating an audio output in block S140. A positional audio controlinput preferably has a location property, optionally a dimension todefine a region around the location, and optionally one or more audioprocessing properties that characterize different audio modifications orcomputing operations to be made in association with audio emanating fromthat location. A default audio processing property may specify noadditional processing beyond extraction. An audio control input mayadditionally be associated with an object identified through the audiomap, image data, or other suitable data. The object association mayenable the audio control input to maintain association as the object istracked. In such an object-associated implementation, the audio controlinput may or may not include a location property.

The audio control inputs may be acquired or set in a variety ofconfiguration modes such as a manual mode, an automatic mode,semi-automatic mode, CV-based monitoring mode, agent sensing mode,and/or any suitable mode of configuration. Different audio controlinputs may be set using different configuration modes.

In a manual configuration mode variation, a user can use arepresentation of the multi-dimensional map of audio sources to set oneor more positional audio control inputs. Accordingly, the method mayadditionally include presenting a representation of the audio map in auser interface S122 and acquiring a set of audio control inputs at leastpartially through interactions with the representation of the audio mapS131 as shown in the exemplary implementation of FIG. 9 . The manualconfiguration mode is preferably facilitated through an application ordevice with a graphical user interface, though other suitable mediums ofa user interface including a programmatic user interface (e.g., API) mayalso be used.

Presenting a representation of the multi-dimensional map of audiosources S122, functions to communicate the relative position of variousaudio sources to a user. Preferably a two- or three-dimensionalgraphical representation of the audio landscape is presented. In onevariation, the live audio signal can be animated in the graphicalrepresentation. In another variation, the graphical representation mayonly mark potential location of an audio source. While presenting theaudio map may be used for setting audio control inputs, therepresentation of the audio map may additionally or alternativelyfacilitate other uses. For example, a phone conferencing device coulduse a generated graphical representation of speakers sharing a phone forthe benefit of other conference members who are listening remotely.

Setting of an audio control input through interactions with arepresentation of the audio map can involve a user selecting a locationand/or region on a graphical representation and thereby setting thelocation property of an audio control input. In one implementation, agraphical representation of the audio map can be presented within anapplication. An audio control input can be created to be associated withthat selected location and/or region, and then the user can set variousaudio processing properties. For example, a user could set the volume,muting the audio source, selecting sound effects, setting additionalprocessing (e.g., recording or saving a transcription), and the like.

In a manual configuration mode variation, audio control inputs may beautomatically set in part or whole. Automatic configuration can be usedto set a location property and/or an audio processing property.Additionally, automatic configuration may be used to dynamically updatethe audio control inputs.

In one variation, acquiring at least one positional audio control inputincludes automatic detection audio source locations and setting of anaudio control inputs at the detected audio source locations S132 asshown in FIG. 10 . Automatic detection of an audio source can be used byapplying a classifier on the audio map to detect probably audio sources.Characteristics of an audio source will generally be consistentgeneration of audio of similar properties from a focused area. Thelocation property can additionally be updated so that the source ofaudio can be tracked as the source moves.

In another variation, one or more audio processing operators may beautomatically set for an audio control input without user input. Themanner of automatic setting can take a variety of forms such as audiosource classification, audio content analysis, contextual analysis,and/or other approaches.

Audio sources may be classified based on the type of audio generatedfrom that location or through other classification approaches (e.g.,visual classification), and that classification can be used to setpreconfigured audio processing operators. For example, a speaker mayreceive an operator to enhance speech audio, while a speaker playingmusic may receive an operator to balance music audio.

Audio content analysis may additionally be used in determining thesettings of audio control inputs. Acquiring an audio control input inone instance may include applying audio content analysis on audioisolated from the audio source location and setting an audio processingproperty of the audio control input based on the audio content analysisS133 as shown in FIG. 11 . The audio content analysis can includeperforming speech to text, performing sentiment analysis, contentanalysis on the spoken words, and/or other suitable forms of contentanalysis. Achieving content analysis may involve initially extracting anaudio source so that the content of the audio source can be analyzed.

As a related variation of automatic content analysis, contextualanalysis may use various inputs that indicate the situation. In oneexample, if the user sets to amplify a speaker near him and mutes soundscoming from behind him, the system may automatically detect that theuser is likely having a conversation and can decrease the audio of otheraudio sources other than speakers near him.

Content analysis as well as contextual setting of an audio control inputmay be used for audio control inputs that were initialized or that werein some part set by a user.

In a CV-based monitoring mode variation, audio control inputs may be atleast partially set and/or updated in response to CV-based modeling ofimage data collected from the environment. Accordingly, the method mayinclude collecting image data and applying computer visioninterpretation of the image data S150; and at least partially setting apositional audio control input in response to the computer visioninterpretation S134 as shown in FIG. 12 . The image data used ininterpreting the environment is preferably image data of at least aportion of the environment covered by the audio map. Imaging data can becollected from one or more statically fixed imaging device. Such animaging device will generally have a third person view of an agent(e.g., a person, machine, etc.). Imaging data may alternatively becollected from a personal imaging device that is coupled to the agent asshown in FIG. 13 . In this example, the imaging data may not be a visualobservation of the agent, but can reflect the view of the agent (e.g.,what the agent is viewing or what may be viewable by the agent).

CV-based modeling of image data can include performing objectclassification and/or identification, object tracking, biometricidentification of a person, gesture detection, event detection,interaction detection, extraction of information from a device interfacesource, scene description, 2D or spatial modeling, and/or any suitableform of image analysis. Preferably, the CV-based modeling can detectpeople and optionally other objects (in particular those that maygenerate sound like a speaker system, a television, a telephone, amachine, etc.). Detected objects can preferably be mapped to audiosources of an audio map.

In one particular variation, CV-based modeling is used to detect thefocus of a person and update the set of audio control inputs so that anaudio output generated for that person corresponds to the direction ofthe person's attention. Accordingly, applying computer visioninterpretation can include detecting direction of attention by an agentand updating the audio control inputs in response the direction ofattention by the agent S135 as shown in FIG. 12 . The agent ispreferably also the same subject for whom the audio output is generated,which may be used in to customize audio mixing to include audio sourcesthat a subject focuses on. Detecting the direction of attention can bedetecting body position (e.g., direction of the front of the body, thehead, or the eyes). In one preferred implementation, gaze analysis maybe performed to detect the direction of a person's visual focus. Moregenerally, direction of attention may be used to narrow the audiosources of interest to the audio sources in the “field of view” of theagent. When the imaging device is part of a wearable computing device(e.g., smart glasses), detection of an audio source in the image datacan be used to detect which audio sources should be activated.

The subject of attention may also impact how the audio control input isset. In one variation, different audio sources may be assigned differentpriorities. For example, an audio source of higher priority may beincorporated in the audio output even when on the periphery of the fieldof view, and an audio source with a lower priority may be limited toincorporation into the audio output when a centered focus of an agent'sattention.

In a similar variation, CV-based modeling of an agent's direction ofattention may be applied across multiple agents in the environment. Theaudio control inputs can be updated in response to collective directionof attentions by multiple agents. This may be done when mixing the audiofor a presentation, when most of the attendees are watching the speakerthat audio may be amplified as shown in FIG. 14A. However, if a largenumber of attendees direct their attention to a fellow attendee thenthat attendee may be amplified (e.g., if someone asks a question duringa presentation) as shown in FIG. 14B. The different agents mayadditionally be ranked or otherwise prioritized when collectivelyanalyzing the attention of a group of agents.

In a related agent orientation sensing variation, audio control inputsmay be at least partially set and/or updated in response agentenvironment orientation of one or more agents. Accordingly, the methodmay include collecting agent environment orientation from a sensingdevice S160; and at least partially setting a positional audio controlinput in response to agent orientation within the environment comparedto the audio map S135 as shown in FIG. 15 . Here environment orientationcan include agent position within the environment and/or agentdirection/orientation (e.g., direction of front of body, angle of head,etc.). As above the agent is preferably a person but may be any suitableentity. Collecting agent environment orientation can include sensingposition using GPS, local positioning, Bluetooth beaconing, CV-basedposition tracking, and/or any suitable form of positioning. For example,an application on a smart phone or smart wearable may supply positiondata of the wearer. Collecting agent environment orientation mayadditionally or alternatively include sensing or collection of agentdirectional orientation, which can include where the person is facing,the direction of the head, and the like. For example, smart glassesequipped with an inertial measurement unit (IMU) that includes anaccelerometer, a digital gyroscope, and/or a magnetometer may be used insensing the angle and direction of the smart glasses. The variations ofthe CV-based monitoring variations may similarly be applied. Forexample, based on the environment orientation, the method may determinethe audio sources in near proximity and/or in the direction of attentionof a person, and appropriately set the audio processing properties foraudio sources of interest. This variation may be used in generatingaudio output for one person (e.g., the agent being monitored forenvironment orientation). This variation may alternatively be used ingenerating audio output based on environment orientation of multipleagents and determining some form of group consensus.

The various modes of configuration may be used individually, in parallel(e.g., different audio control inputs set based on differentconfiguration modes), and/or in combination. For example, a set of audiocontrol inputs may be partially set by audio content analysis andCV-based monitoring. In particular, CV-based monitoring may be used todetermine primary audio sources of interest based on direction ofattention, and of these audio sources audio content analysis candetermine the audio sources producing audio of interest to a subject.

Block S140, which includes generating an audio output according to thepositional audio control inputs and the audio inputs, functions tochange or augment the audio input to approximate the requests conveyedthrough the audio control inputs. In effect, the audio control inputstreats the different locations of the audio map as audio channels thatcan be mixed, processed, or otherwise used in producing some result.

The generating of an audio output preferably involves the processing orextraction of audio sources if not already processed/extracted, and thenthe collective processing of multiple audio sources (e.g., mixingtogether into a single audio output.

Accordingly, generating the audio output can include, for an audiocontrol input, at least partially extracting an audio source from theaudio map as indicated by the location property of the audio controlinput and applying an audio transformation indicated by the audioprocessing property of the audio control input. Extracting an audiosource preferably involves the collaborative use of multiple audioinputs to enhancing sounds originating from the location while possiblydeemphasizing or quieting sounds in the audio input not originating fromthe location. The extraction preferably functions to substantiallyisolate an audio source that is an audio signal representative of audiofrom that location. The extraction in some ways is like generating avirtual audio recording from that location (as opposed to the actualaudio recording at the real locations of the audio inputs). Extractingthe audio source in one preferred implementation can include applyingphased alignment processing of at least a subset of the audio inputs(e.g., two or more). The phased alignment processing is preferablyconfigured to substantially isolate audio at the location property. Thephased alignment processing preferably accounts for the displacementsbetween the microphones used to record or sense the audio input, andapplies a time shift to the set of audio inputs to account for the timedisplacements, which functions to substantially align the audio featuresoriginating from that location in time. Then the audio inputs can becombined to reinforce aligned audio features. Audio originating fromaudio sources outside of the targeted location are preferably notreinforced and may be actively minimized so that the resulting outputfor an audio source reflects some measure of isolation of the audiosource. For example, extracting an audio source of a person speakingpreferably results in an audio source data representation where othercontributors to sound in the environment are minimized and the speakeris more clearly emphasized.

Generating the audio outputs can additionally include applying the audioprocessing operators to the audio sources. Various techniques may beused in amplifying, deamplifying, equalizing, compressing, and/orapplying audio effects such as filtering, changing speed, shiftingpitch, and the like. Finally, the audio output can be a combination ofmultiple audio sources. For example, an implementation configured tomonitor the speaking of two speakers will generate two audio sources atthe location of each speaker and then combine the audio source datarepresentations to form one audio output.

The audio output can be any suitable form of audio output such as amono-channel audio output, stereo audio output, a 3D audio output, orany suitable form of audio output. For example, head-related transferfunctions may be used in combination with audio source location topreserve the positional aspects of the audio. Such audio-effects cansimilarly be made to break from reflecting true life and can be set tohave any suitable effect.

The method is preferably used for producing audio for one or moresubjects. The subjects are generally people. In one implementation, theaudio output is produced for a particular subject. In the case ofCV-based configuration or orientation sensing configuration, the subjectassociated with the audio output will generally also be the monitoredagent during configuration. The audio can be played in real-time, butmay additionally be recorded for later playback. A person may useheadphones, a hearing aid, a speaker system, or any suitable system tolisten to the audio output. The audio output could similarly begenerated and played for a number of people. For example, when used formanaging audio during a conference call or a presentation, the audio canbe played over a speaker system intended for multiple people.

In one variation, the audio output is played by a listening device. Asdiscussed above, one implementation may use the presentation of an audiomap within a user interface of a listening application. The listener canuse the application to set and update the audio control inputs.Furthermore, such an implementation may synchronize and use multipleapplication instances. Configuration of audio control input could beshared across multiple people. Additionally, multiple applicationinstances may facilitate the acquisition of audio inputs. For example,someone using a listing device during a meeting may ask others toconnect their listing applications so that a microphone array can begenerated ad-hoc to implement the method so the listener can hear theconversation better.

In an alternative variation, the audio output may not be directly playedfor the purposes of generating sound. The audio output can be used forother purposes such as record keeping, input to a computer system,and/or other purposes. In one preferred variation, the method includescommunicating the audio output to an audio-based user interface systemas shown in FIG. 16 . An audio-based user interface system is generallya computing system that uses spoken input to direct actions of acomputing device. An audio-based user interface can be a digitalpersonal assistant, or any suitable voice-interface. In one particularimplementation, the identification of different audio sources and thecustom extraction of distinct audio sources can be used in a crowdedenvironment so that one system may be used in collecting and managingvoice commands from different people. For example, in a store employingthis method possibly in combination with some CV-based application(e.g., automatic self checkout), the different customers could issuevoice commands that can be extracted, uniquely associated with thecustomer, and used by the customer to direct system interactions. Inanother variation, this can be used to follow voice commands of selectpeople with privileges to issue voice commands. For example,CV-monitoring system can identify three people as having control over aset of network devices (e.g., lights in a room, a sound system, apresentation system, etc.), and the method is then used to only useaudio extracted from those three people as voice commands.

Another capability of the method can be to generate multiple distinctaudio outputs from shared audio inputs. Preferably, a second set ofaudio control inputs can be acquired using any of the variationsdescribed above, and a second audio output can be generated using thesecond set of audio control inputs and the audio inputs. In this way,one audio output may be generated to facilitate one person listening totwo particular speakers, and a second audio output may be generated tofacilitate another person listening to three different speakers. As themethod facilitates the synthesizing of different audio source basedchannels, these channels can be mixed in different ways for differentaudiences or purposes. These audio outputs can be played or as discussedabove used for other purposes such as input for an audio-based userinterface system.

The systems and methods of the embodiments can be embodied and/orimplemented at least in part as a machine configured to receive acomputer-readable medium storing computer-readable instructions. Theinstructions can be executed by computer-executable componentsintegrated with the application, applet, host, server, network, website,communication service, communication interface,hardware/firmware/software elements of a user computer or mobile device,wristband, smartphone, or any suitable combination thereof. Othersystems and methods of the embodiment can be embodied and/or implementedat least in part as a machine configured to receive a computer-readablemedium storing computer-readable instructions. The instructions can beexecuted by computer-executable components integrated with apparatusesand networks of the type described above. The computer-readable mediumcan be stored on any suitable computer readable media such as RAMs,ROMs, flash memory, EEPROMs, optical devices (CD or DVD), hard drives,floppy drives, or any suitable device. The computer-executable componentcan be a processor but any suitable dedicated hardware device can(alternatively or additionally) execute the instructions.

As a person skilled in the art will recognize from the previous detaileddescription and from the figures and claims, modifications and changescan be made to the embodiments of the invention without departing fromthe scope of this invention as defined in the following claims.

We claim:
 1. A method comprising: receiving multiple audio inputs from aset of distinct locations; determining a multi-dimensional audio mapfrom the audio inputs; acquiring a set of positional audio controlinputs applied to the audio map, each audio control input comprising alocation and audio processing property; and generating an audio outputaccording to the audio control inputs and the audio inputs.
 2. Themethod of claim 1, wherein generating the audio output comprises, for atleast one instance of the audio control inputs, at least partiallyextracting an audio source from the audio map as indicated by thelocation property and applying an audio transformation indicated by theaudio processing property.
 3. The method of claim 2, wherein generatingthe audio output further comprises combining at least two audio sourcesand thereby generating the audio output.
 4. The method of claim 2,wherein extracting an audio source from the audio map comprises applyingphased alignment processing of at least a subset of the audio inputs,the phased alignment processing configured to isolate audio at thelocation property.
 5. The method of claim 1, further comprisingpresenting a representation of the audio map in a user interface;wherein acquiring the set of audio control inputs is at least partiallydefined through interactions with the representation of the audio map.6. The method of claim 5, wherein acquiring an audio control inputdefined through interactions with the representation of the audio mapcomprises setting a location property of the audio control input throughselection of a location on the representation of the audio map andsetting an audio processing property within the user interface.
 7. Themethod of claim 1, wherein acquiring at least one audio control inputcomprises automatic detection of audio source locations and settingaudio control inputs at the detected audio source locations.
 8. Themethod of claim 1, wherein acquiring at least one audio control input ofthe set of audio control inputs further comprises applying audio contentanalysis on audio extracted from the audio source location and settingan audio processing property of the audio control input based on theaudio content analysis.
 9. The method of claim 1, further comprisingcommunicating the audio output to an audio-based user interface system.10. The method of claim 1, wherein the audio output is generated inassociation with a first person; and further comprising acquiring asecond set of positional audio control inputs applied to the audio map;and generating a second audio output in association with a secondperson, the second audio output being generated according to the secondset of audio control inputs and the audio inputs.
 11. The method ofclaim 1, wherein receiving multiple audio inputs comprises recordingaudio from a distributed set of personal computing devices andcommunicating the audio input from each personal computing device to aprocessing system.
 12. The method of claim 1, wherein receiving multipleaudio inputs comprises recording audio from a microphone array comprisedof distinct microphones positioned within an environment.
 13. The methodof claim 1, further comprising playing the audio output in a personallistening device.
 14. The method of claim 13, further comprisingpresenting a representation of the audio map in a user interface of aninstance of a listening application.
 15. The method of claim 14, whereinreceiving multiple audio inputs comprises a set of associatedapplication instances each receiving an audio input.
 16. The method ofclaim 1, further comprising collecting image data; applying computervision interpretation of the image data; and at least partially settingan audio control input in response to the computer visioninterpretation.
 17. The method of claim 16, wherein applying computervision interpretation comprises detecting direction of attention by aperson, and updating the audio control inputs in response the directionof attention by the person.
 18. The method of claim 17, wherein theaudio output is played on a speaker system associated with the person.19. The method of claim 16, further comprising detecting direction ofattention by multiple people, and updating audio control inputs inresponse to collective direction of attention by multiple people. 20.The method of claim 16, further comprising isolating an audio sourcefrom at least one location as indicated by an audio control input;applying audio content analysis on the audio source; and wherein settingan audio control input is additionally partially set in response to theaudio content analysis.